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Abstract. Given a set X of "empirical" points, whose coordinates are per- 
turbed by errors, we analyze whether it contains redundant information, that 
is whether some of its elements could be represented by a single equivalent 
point. If this is the case, the empirical information associated to X could be 
described by fewer points, chosen in a suitable way. We present two different 
methods to reduce the cardinality of X which compute a new set of points 
equivalent to the original one, that is representing the same empirical infor- 
mation. Though our algorithms use some basic notions of Cluster Analysis 
they are specifically designed for "thinning out" redundant data. We include 
some experimental results which illustrate the practical effectiveness of our 
methods. 



1. Introduction 

Often numerical data in scientific computing arise from real-world measurements, 
and so are perturbed by noise, uncertainty and approximation. A common tech- 
nique to counter this phenomenon is to make "excessively many" measurements, 
and as a consequence the resulting body of empirical data appears as a "redun- 
dant" set carrying relatively little information compared to its cardinality. Our 
aim is to reduce this redundancy by replacing subsets of close values, which we 
regard as repeat measurements, by a single representative value. 

We view an empirical point {p, e) as a "cloud" of data which differ from p 
by less than the tolerance e. If the intersection of different clouds is "sufficiently" 
large, we can replace them by a single empirical point carrying essentially the same 
empirical information. We illustrate this intuitive idea in the following example 
where an initial set of 12 points is "thinned out" to an equivalent set of 4 points. 

Example 1. Given the set X of 12 points in 

X = {(-1,-1), (0,-1), (1,-1), (-1,0), (0,0), (1,0), 

(-1,1), (0,1), (1,1), (5,-2.9), (5,0), (5,2.9)} 

we suppose that each coordinate is perturbed by an error less than 1.43. 
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Figure 1 . Appropriate partition of X 



In this situation, the first nine points most Hkely derive from measurements 
of the same quantity; therefore it is quite reasonable (and appropriate) to collapse 
them onto a single candidate, for example the point (0, 0). In contrast, since the 
last three points are well separated, they should not be collapsed. This partition, 
shown in Figure [U is found by our algorithms, as reported in Examples [5] and [31 

Based on the idea of clustering together empirical points which could derive 
from different measurements of the same datum, we have designed two algorithms 
which take a large set of redundant data and produce a smaller set of "equivalent" 
empirical points. Typically the smaller set contains far fewer elements than the 
original one, with obvious consequent gains both in computational speed and in 
memory resources used in subsequent processing of the data. 

This paper is organized as follows. In Section 2 we introduce the concepts 
and tools useful to our work, focussing our attention on the idea of "collapsable 
sets" of empirical points. Section 3 describes the Agglomerative and the Divisive 
Algorithms to thin out sets of empirical points while preserving the overall geomet- 
rical structure. The relationship with the theory of Cluster Analysis is discussed 
in Section 4. In Section 5 we present some numerical examples to illustrate the 
behaviour of our algorithms on different geometrical configurations of points. The 
conclusions are summarized in Section 6. 

2. Basic Definitions and Notation 

This section recalls the definitions and tools used later in the paper. 

We suppose that the points belong to the space M", n > 1, and we use the 
norm || • ||2. Further, given an n x n positive diagonal matrix E, we shall also use 
the weighted norm || • \\e,2 as defined in [2]. For completeness, we recall here their 
definitions: 



n 
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Later on the index 2 wiU be omitted for simphcity of notation. 

Intuitively an empirical point, representing real-world measurements, is a 
point p of R" whose coordinates are affected by noise, while only an error estima- 
tion on them is known. We suppose we know (for every 1 < i < n) an estimate 
Ei G M+ of the error in the i-th component of p, so that each point r which differs 
from p componentwise by less than Si can be considered equivalent to p from a 
numerical point of view. We can formalize this idea by means of the definition of 
empirical point, introduced by Stetter in [7]. 

Definition 2.1. Let p e M" be a point and let e = (ei, . . . , £„) with each Si G M+, 
be the vector of the componentwise estimated errors. An empiriccJ point is the 
pair (p, e), where we call p the specified value and e the tolerance. 

In this paper we shall consider sets of empirical points all having the same 
fixed tolerance e. This is a natural assumption if the points derive from real-world 
data measured with the same accuracy. Additionally, this hypothesis simplifies the 
theoretical study. 

From now on we denote by e = {ei, . . . , £„) with each Si G M"*", the fixed 
tolerance. So given any p G M", we write p"^ to mean the corresponding empirical 
point having p as specified value and e as tolerance. We denote by = {pi, . . . ,pl} 
a set of empirical points each having the tolerance e and by X = {pi, . . . ,ps} 
the set of the specified values associated to X"^. We define the diagonal matrix 
E = diag{l/ei, . . . , l/£n) and shall use the _E-weighted norm on K." in order to 
"normalize" the distance between points w.r.t. the tolerance e. 

An empirical point naturally defines the following set: 

N{p') = {r G M" : ||p-r||£; < 1} 

Each element in N{p'^) can be obtained by perturbing the coordinates of the spec- 
ified value p by amounts less than the tolerance; for this reason we can say that 
the points of N^p"^) represent the same empirical information as p. Analogously, 
each element of npgx-^(p^)i if this intersection is not empty, represents the same 
empirical information as the whole set X. Although the choice of an element in 
this intersection is quite free, we decide to represent a set of "close" points with 
their centroid. The following definitions are introduced in order to formalize this 
idea. 

Definition 2.2. The set of empirical points X^ = {p\, . . . ,pl} is collapsable if 

lb. -'Z||b<1 V2 = 1,...,s (2.1) 

where q = ^ Pi the centroid of X. 

If X"^ is collapsable, the centroid g of X belongs to each of the sets N{pf); so 
the empirical point is numerically equivalent to every point in X^ . We formalize 
this idea as follows. 
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Definition 2.3. The empirical centroid of a set is the empirical point where q 
is the centroid of the set X. If X^ is a collapsable set, its empirical centroid is called 
its valid representative. 

If a set of empirical points contains a collapsable subset, it contains some 
redundancy, i. e. it carries relatively little empirical information compared to num- 
ber of points in it. The methods presented in this paper arc designed to "thin out" 
such sets by finding a smaller set of empirical points with much lower redundancy 
which still contain essentially the same empirical information. 

3. Algorithms 

In this section we describe two algorithms that, given a set X^ of empirical points, 
compute a partition £^ = {Lf , . . . , L|} of it, consisting of non-empty collapsable 
sets, and a set = {gf , . . . , g|} where each qf is the valid representative of Lf. 
Our algorithms differ in the strategies for building the partitions: 

1. the Agglomerative Algorithm initially puts each point of X^ into a different 
subset and then iteratively unifies pairs of subsets into a larger collapsable 
set; 

2. the Divisive Algorithm initially puts all the points of into a single subset 
and then iteratively splits off the remotest outlier and "evens up" the new 
partition. 

3.1. The Agglomerative Algorithm 

The Agglomerative Algorithm (AA) implements a unifying method. The sets in 
the partition are determined by an iterative process. Initially each set contains 
a single original empirical point, then iteratively the two closest sets are unified 
provided their union is collapsable. This method is quite fast when the input points 
are well separated w.r.t. the tolerance, since a small number of set unifications is 
required. 

Theorem 3.1. (The Agglomerative Algorithm) 

Let X^ = be a set of empirical points, with each pi e K" and a 

common tolerance e — (ffi, . . . ,£«)• Let [| • \\e be the weighted norm on M" w.r.t. 
E = diag{l/si, . . . , 1/sn). Consider the following sequence of instructions. 
AAl: Start with the subset list C = [Li, . . . , Lg] where each Li = {pi\, and the 

list Y = [gi , . . . , g^] of the centroids of the Li . 
AA2: Compute the symmetric matrix M = {ruij) such that niij = \\qi — qjUs 
for each qi,qj € Y. 

AA3: // I Y| = 1 or min{rnjj : i < j} > 2 then return the lists C and Y and stop. 
AA4: Choose i, j s.t. mg = min{mij : i < j} and compute the centroid q of 
Li U Lj 

\Li\qi + \L-,\qj 
^ \L-.\ + \L-,\ 
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AA5: // \\p — q\\E < 1 for every p E Li (J Lj then in C replace Li by Li U Lj 
and remove Ly Similarly, in Y replace qi by q and remove q-j and then go to 
step AA2. Otherwise put — oo (any value greater than 2 will do) and go 
to step A A3. 

This algorithm computes a pair [C, Y) such that: 

• {Lf : Li (z C} is a partition of IL^ into collapsable sets such that no pair can 
be unified into a collapsable set; 

• for each G Y the empirical point qf is the valid representative of Lf. 

Proof. First we prove finiteness. Step AA2 is performed only finitely many times 
and so a finite number of matrices M is computed. In fact, after the first compu- 
tation of M, this step is performed only when the algorithm removes an element 
from Y, i.e. at most s — 1 times. Now, also step AA4 is performed only finitely 
many times on the same matrix M, because it is performed only when the minimal 
element mg of the matrix M is less than or equal to 2 and then either two subsets 
are unified or TOg is replaced by oo, but this can happen at most s^/2 times. 

Next we show correctness. First, note that the elements of C define a partition 
of X. In fact, in step AAl we set C — [{pi}, . . . , {Ps}]', the only place where C 
changes is in Step AA5 when we unite two of its elements, and so a new partition 
of X is obtained. Obviously is also a partition of X^. 

For each Li £ C, the corresponding empirical set Lf is collapsable. This is 
clearly true in step AAl. Step AA5 unites two elements of C only if their union 
is collapsable: step AA4 computes the centroid q of L,; U Lj and step AA5 tests 
condition (|2.ip for each point in Li U Lj . 

Now we prove that upon termination the union of any pair of elements of C 
is not collapsable. If the algorithm stops because Y (and C too) contains a single 
element, the conclusion is trivial. Otherwise, the algorithm ends because mij > 2 
for all i < j. We observe that the elements my of the final matrix M are such that 
either = ||qi — jj^; or mij = oo but \\qi — qj\\E < 2. The case where — oo 
is trivial: an entry in M can become oo only in step AA5 after having verified that 
Lf U is not collapsable. In the case where mij is finite we show that the union 
of Lj, Lj is a not collapsable set by contradiction. We suppose that ||p — (?||_e < 1 
for each p G LiU Lj, where q is the centroid of Li U Lj. If m = \Li\ and n — \Lj\, 
we have 




From the hypothesis, we deduce that \\qi — qjWs < 2, a contradiction. 
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Finally, we can conclude the proof since, by construction, each element q.i e Y 
is the centroid of Li and is collapsable, so the empirical centroid is indeed 



Note that, in step AA5, we must check the condition that ||p — qHb < 1 for 
each p G L; U Lj. In fact, if we check only the condition H^i — (ZjIIb < 1, there are 
pathological examples where not collapsable sets are built in the final partition 
(see Example [8]). 

The algorithm as presented here can easily be improved from the computa- 
tional point of view: in step AA2 it is not necessary to compute a new matrix M 
after uniting i; and L^, but suffices to remove the j-th column and to update the 
i-th row. 

In the following example we apply the Agglomerative Algorithm on the points 
of Example [T] to show that the desired partition is obtained (see Figure [T]). 

Example 2. Let X"^ — {pf , . . . ,pf2} be a set of empirical points with tolerance 
£ = (1.43, 1.43), whose specified values coincide with the set X of Example[l] 



The AA computes, at each step, the following partitions, only clustering together 
the first nine points. 

1. C= {{pi}, {p2}, {P3}, {Pi}, {Pd}, {Pe}, {pr}, {Ps}, {Pd}, {Pio}, {Pii}, {P12}} 

2. C= {{pi, P2}, {ps}, {pa}, {Pb}, {pe}, {P7}, {pa}, {pb}, {pio}, {pii}, {P12}} 

3. £ = {{pi, P2, P4}, {ps}, {ps}, {Pe}, {pr}, {ps}, {pg}, {pio}, {pii}, {P12}} 

4- = {{pi,P2,P4},{Ps,Pe},{P5}AP7}APs},{P9},{Plo},{Pll},{Pl2}} 

5. £ = {{P1,P2,P4},{P3,P6},{P5,P8},{P7},{P9},{P10},{P11},{P12}} 

6. £ = {{Pi,P2,P4,P5,P8},{P3,P6}AP7}AP9},{P10},{P1i}AP12}}- 

7. C= {{P1,P2,P3,P4,P5,P6,P8},{P7},{P9},{P10},{P11},{P12}} 

8. £ {{Pl, P2, P3, P4, P5, P6, P7, Ps}, {Pg}, {Pio}, {Pll}, {P12}} 

9. £ = {{Pl, P2, P3, P4, Ps, P6, P7, Ps, Po}, {Pw}, {Pll} , {P12}} 

3.2. The Divisive Algorithm 

The Divisive Algorithm (DA) implements a "subdivision" method. The sets in the 
partition are determined by an iterative process. Initially the partition consists 
of a single set containing all the points. Then iteratively DA seeks the original 
point farthest from the centroid of its set. If the distance between them is below 
the tolerance threshold then the algorithm stops, because all original points are 
sufficiently well represented by the centroids of their sets. Otherwise it splits off 
the worst represented original point into a new set initially containing just itself. 
Then DA proceeds with a redistribuition phase with the aim of associating each 
original point to the current best representative subset (locally) minimizing the 
total central sum of squares, defined as follows [B]. 



the valid representative of Lf. 



□ 



X 



{(-1,-1), (0,-1), (1,-1), (-1,0), (0,0), (1,0), 
(-1,1), (0,1), (1,1), (5,-2.9), (5,0), (5,2.9)} 
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Definition 3.2. Let X be a subset of M" and let q be its centroid. The central sum 
of squares of X is defined to be: 

Definition 3.3. Let C = {Li, . . . ,Lk} be a partition of the set X. The total central 
sum of squares of the partition C is defined to be: 

k 

where Ij is the central sum of squares of Lj . 

If X"^ contains large subsets of close empirical points, DA turns out to be 
more efficient than AA, since a smaller number of subdivisions is required. 

Theorem 3.4. (The Divisive Algorithm) 

Let 'K^ = {p\, ■ ■ ■ ,p1} he a set of empirical points, with each pi G K" and a 
common tolerance e = (ei, . . . ,£«)■ Let \\ • \\e he the weighted norm on K." w.r.t. 
E = diag{l/e\, . . . , l/£n)- Consider the following sequence of instructions. 

DAI: Start with the list C = [Li] where Li = X, and the centroid list Y = [qi] 

where qi is the centroid of Li. 
DA2: Let £ = [Li, . . . , Lr] and Y = [qi, . . . ,qr], the centroid list of the elements 

of C. For each pi E X set di = \\pi — qj\\E where Lj is the subset (ofX) to 

which Pi helongs. Build the list D = [di, . . . ,ds]- 
DA3: // max{D) < 1 then return the lists C and Y, and stop. 
DA4: Choose an index i such that d; = max(_D), and compute the index j of 

the suhset Lj to which p\ helongs. Remove p\ from Lj and compute the new 

centroid q^ of L-^; append L^+i = {pt} to L and 5^+1 = Pt to Y. 
DA5: Compute the total central sum of squares I{C) of the new partition C 
DA6: For each y» S X and for each L^ € C, denote hy Cp^k the partition C hut 

with p moved into Lfe. Compute the total central sum of squares L{Cp^k)- 
DA7: Choose a point p e X and an index k s.t. 

^i^p,k) = min{-''('Cp,fe) : p e X, Lfe e £} 

DAS: If I{£p J.) > /(£) then go to DA2. Otherwise set C = C~^. Compute the 

centroids of the new partition C. Go to DA5. 

This algorithm computes a pair {£, Y) such that: 

• {Lf : Li G £} is a partition o/X^ into collapsable sets; 

• for each qi G Y, the empirical point qi is the valid representative of Lf . 

Proof. Later on we shall refer to the loop DA5 DA8 as "the redistribution phase": 
points are moved from one subset to another in order to strictly decrease the total 
central sum of squares. Note that in the redistribution phase the cardinality of £ 
does not change as the algorithm never eliminates any set in C. Indeed, if the 
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singleton set Lj ~ {p} belongs to C, the point p will not be moved to another 
set Lfc G C leaving Lj empty, since this new configuration cannot have smaller total 
central sum of squares: the combined central sum of squares of the sets Lj — {p} 
and Lfe is 

+ /fc =0 + ^ \\r-qkf 

where qk is the centroid of L^ , whereas the combined central sum of squares of the 
new sets L'j — and L'^, = Lk U {p} is 

^■+4-0+ \\r-q',f + \\p-q',A 

\reLk / 

where q'^, is the centroid of L'^, — Lk U {p}. And since q^ is the centroid of Lk, we 
have X^reifc Ik^'i'felP ^ J2reL^ lk~9fc|P- Consequently the new total central sum 
of squares cannot be smaller. 

Now we prove finiteness. The algorithm comprises two nested loops: the outer 
loop spanning steps DA2-DA8, and the redistribution phase (steps DA5-DA8). 
The outer loop cannot perform more than s iterations because step DA4 can be 
performed at most s times; anyway, after s iterations the termination criterion in 
step DAS will surely be satisfied as all the di would be zero. 

The redistribution loop will perform only finitely many iterations. Each iter- 
ation strictly reduces the total central sum of squares, and since X is finite it has 
only finitely many partitions. Consequently there are only finitely many possible 
values for the total central sum of squares. 

Next we show correctness. The elements of C define a partition of X. This 
is trivially true in step DAI. The creation of a new subset in step DA4 clearly 
maintains the property. The redistribution phase merely moves points between 
subsets (in step DAS), so also preserves the property. 

The test in step DAS guarantees that upon completion of the algorithm 
each Li ^ C corresponds to a collapsable . By construction, each element G Y 
is the centroid of Li. Thus qf is the valid representative of . □ 

In the following example we apply the Divisive Algorithm to the points of 
Example [T] to show that the desired partition is obtained (see Figure [T]). 

Example S. Let = {Pi, ■ ■ ■ ,Pi2\ be a set of empirical points with tolerance 
£ — (1.4S, 1.4S), whose specified values coincide with the set X of Example [TJ 



X = {(-1,-1), (0,-1), (1,-1), (-1,0), (0,0), (1,0), 
(-1,1), (0,1), (1,1), (5,-2.9), (5,0), (5,2.9)} 
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The DA computes, at each step, after the redistribution phase, the foUowing 
partitions. 

1. £ = {{Pl,P2,P3,P4,P5,P6,P7,P8,P9,Pl0,Pll,Pl2}} 

2. {{pi,P2,P3,P4,P5,P6,P7,P8,P9},{Pl0,Pll,Pl2}} 

3. £ = {{pi,P2,P3,P4,P5,P6,P7,P8,P9},{Plo},{Pll,Pl2}} 

4. £ = {{pi,p2,P4,P5,Ps,P3,P6,P7,P9}, {Pio}, {Pll}, {P12}} 

As mentioned before, DA performs fewer iterations than AA since several input 
points are close together w.r.t. the tolerance. 

3.3. A particularly quick method: the grid algorithm 

We recall the oo-norm and its corresponding _E- weighted norm on R", see [2]: 

|w||oo = max \vi\ and ||v||£;,oo = ll-E'^'lloo 

i—l...n 

where E — diag{l/ei, . . . , l/Cn), as before. 

A particularly quick method for decreasing the cardinality of the set X"^ can be 
designed using a regular grid, consisting of half-open balls of radius 1/2 w.r.t. the 
iJ- weighted norm || • ||e,oo. We arbitrarily choose one ball to have the origin as its 
centre then tessellate to cover the whole space. 

This algorithm computes a partition of X"^ by gathering all the empirical 
points whose specified values lye in the same ball into the same subset. Suppose 
that one of these subsets comprises the empirical points pf , . . . ^"^^ 9^ be 
their empirical centroid, then q"^ is a "good" representative of each because 



\\Pi - qWe,, 



Pi - —T^Pi 

i=i 



1 



However, in general such a subset is not collapsable, a notion defined in terms of 
the 2-norm. 

Note that, since the separations of the empirical points are ignored by this 
method, unsatisfactory partitions can be obtained, e.g. close points may happen to 
belong to different balls and so be assigned to different subsets. Nevertheless, this 
drawback is compensated by the speed and simplicity of the method. In particular, 
this grid method (with a smaller radius) can be used to reduce the bulk of a 
very large body of data before applying one of the more sophisticated but slower 
algorithms, AA or DA. Another application of the grid method is to help choose 
the more suitable algorithm between AA and DA by estimating the numbers of 
sets in the partitions which would be produced. 



4. Relationship with Cluster Analysis 

The idea of analyzing a large body of empirical data and of partitioning it into 
sets of "similar values" has been well studied in the theory of Cluster Analysis 
[e.g. see [J]). The overall aim of Cluster Analysis is to separate the original data 
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into clusters where the members of each cluster are much more similar to each 
other than to members of other clusters. In contrast, our methods are more con- 
cerned with thinning out groups of very close values while ignoring more distant 
points. Below we show how Ward's "classical" algorithm [6J, an agglomerative 
hierarchical method, and Li's more recent algorithm [5], a divisive hierarchical 
method, partition the empirical points of Example [T] 

Example 4. Let be the set of empirical points whose set of specified values is 
given in Example[Tl similarly, let e = (1.43, 1.43) as given there. We recall that in 
Examples [5] and [3] both our algorithms A A and DA obtained the minimal partition 
into collapsable sets, as illustrated in Figure [TJ 

Ward's and Li's algorithms do not obtain this minimal partition. In fact, 
after 8 steps. Ward's algorithm puts the points (5, —2.9) and (5, 0) into the same 
cluster, while the first nine points of X still belong to different clusters. Since this 
is an agglomerative method no set of points is split during the computation, so 
Ward's algorithm fails to recognise the collapsable set of nine points. In a similar 
vein, Li's algorithm goes astray at the third step: it divides the first nine points 
of X into two subsets while the points (5, —2.9) and (5, 0) still belong to the same 
cluster. Since this is a hierarchical divisive method, once a set is split it can never 
be joined together again, so Li's algorithm needlessly splits the collapsable set of 
nine points. 

Now we consider another method of Cluster Analysis, QT Clustering [5], be- 
cause it has a number of similarities to our methods, especially AA. QT Clustering 
computes a partition of the input data using a given limit on the diameter of the 
clusters. It works by building clusters according to their cardinality, while we are 
primarily interested in the local geometrical separations of the input data. 

Example 5. Let X^ be a set of empirical points with tolerance s — (0.5) and 
with specified values X = {0, 0.05, 0.9, 1, 1.2} C R. Applying the QT Clus- 
tering algorithm with maximum cluster diameter equal to 2e, we obtain the par- 
tition {{0,0.05,0.9,1}, {1.2}} where {0,0.05,0.9,1}^ is a not collapsable set. In 
contrast, if we apply A A or DA to X*^, we obtain the more balanced partition 
{{0, 0.05}, {0.9, 1, 1.2}} whose elements consist of specified values of collapsable 
sets. We maintain that our partition is more plausible as a grouping of noisy data. 

5. Numerical Tests and Illustrative Examples 

In this section we present some numerical examples to show the effectiveness and 
the potential of our techniques. Both AA and DA have been implemented using 
the C++ language, and are included in CoCoALib P]. All computations in the 
following examples have been performed on an Intel Pentium M735 processor 
(at 1.7 GHz) running GNU/Linux and using the implementation in CoCoALib. 
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Example 6. Clouds of empirical points. 

In this example we consider an empirical set containing two well separated 
empirical points and three clusters, two big and one small. Both AA and DA 
compute five valid representatives for X'^, but because the result comprises very 
few points DA is faster than AA. 

Let X^ be a set of empirical points, with tolerance e = (20, 20) and specified values 
X = Uf^iX, C where 

Xi consists of 82 points lying inside the disk of radius 10 centered on (0, 0), 
X2 consists of 64 points lying inside the disk of radius 10 centered on (40, 50), 
X3 = {(49, 0), (50, 0), (50, 1)}, X4 = {(9, 41)} and X5 = {(-10, 80)}. 

Both AA and DA compute the "intuitive" partition consisting of 5 subsets i,; — X^ 
for i = 1, . . . , 5, as shown in Figure [H 




-40 -20 go 40 &0 



Figure 2 . Appropriate partition of X 



Example 7. Empirical points close to a circle. 

In this example we compare the behaviour of AA and DA on a family of test cases, 
which comprises sets of empirical points with similar geometrical configurations 
but with differing "densities" . Let Xi , X2 C be two sets of points lying close 
to the circle of radius 200 and centered at the origin. They contain 2504 and 5032 
points, respectively. For simplicity we choose a tolerance e = (ei, £2) with ei — £2- 
The numerical tests are performed by applying both AA and DA to the empirical 
sets Xf and Xf for various values of £, viz. £1 = 2'^ for /c = 0, . . . , 6, since, for 
a fixed set of points simply increasing e effectively increases the density of the 
points. 

In Table[T]we present the results obtained processing Xi and X2 respectively. 
The first column contains the value of the tolerance, the columns labeled with 
"#VR" contain the number of the valid representatives computed by AA and DA 
respectively, while those labeled with "Time" show the timings (in seconds) of each 
algorithm. The results show that DA runs quickly if e is large, that is when the 
set of empirical points is dense enough, since only a few splittings of the original 
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185 


577 s 


8 


108 


18 s 


87 


66 s 


121 


317 s 


86 


314 s 


16 


56 


50 s 


41 


33 s 


61 


733 s 


41 


166 s 


32 


29 


117 s 


20 


15 s 


28 


1680 s 


21 


79 s 


64 


13 


2633 s 


10 


6 s 


14 


3695 s 


10 


25 s 



Table 1. Points close to a circle 



set are needed. On the other hand, when the points are well separated, AA is 
preferable since the final partition consists of a large number of sets. 

Figure [3] shows a subset of Xi (the crosses) and its valid representatives 
(the dots) w.r.t. the tolerance e = (16, 16). 




Figure 3. Valid representatives of Xi 



The computational timings can be drastically reduced if we perform a grid 
procedure before applying AA or DA (see Section [3?3)) . Let us consider two cases 
where computation time was high: AA with e = 64, and DA with e = 2. In 
the case AA with e — 64, we make a first reduction of the data using a grid 
whose balls have a weighted radius of 1/4; the computation takes 0.14 seconds 
and produces 48 points. Now A A is applied to this result, and produces an output 
of 13 points in 0.01 seconds — overall far faster than applying AA directly. How- 
ever, the final result is less accurate than that obtained by applying AA directly. 
The same remarks hold for the test with DA and e = 2: using a grid whose balls 
have a weighted radius of 1/2 we obtain 1657 points in 0.2 seconds; then the exe- 
cution of DA on this output takes 83 seconds to return 466 points. Once again, a 
drastic reduction in time at the cost of a lower quality result. 



Thinning Out Redundant Empirical Data 



13 



Example 8. Example of the "zip" 

This first example illustrates the necessity of the test at Step AA5 of AA. Indeed, 
if the condition is not checked the algorithm builds a partition consisting of not 
collapsable sets. 

Let be a set of empirical points whose tolerance is e = (2.199, 2.199) and whose 

set of specified values X C is given by: 

X= {(0.1,2), (2,0), (4.2,0), (6.4,0), (8.6,0), (3.1,3) (5.3,3), (7.5,3)} 

Applying AA to the set X*^ we obtain the following partition of X 

{{(0.1,2), (3.1,3)}, {(2,0), (4.2,0)}, {(6.4,0), (8.6,0)}, {(5.3,3), (7.5,3)}} 

for which the set of specified values of the valid representatives is 

¥ = {(1.6,2.5), (3.1,0), (7.5,0), (6.4,3)} 

However, if we check only the distance between the centroids in step AA5, all the 
elements of X^ are placed in a single set which is obviously not collapsable. 

Example 9. Example of the "three-pointed star" 

Wc have seen that AA always produces a partition into collapsable sets such that 
no pair can be unified into a collapsable set. In most cases the partition produced 
by DA also enjoys this property; however, this is not true in general. Such a 
situation is shown in this example. 

Let X^ be a set of 6 empirical points whose tolerance is s = (1,1) and whose set 
of specified values X C is given by: 

X = {(0.577,0.99), (0.577,-0.99), (0,0.0001), (0,0), (-1.1551,0), (-1.155,0)} 
Applying both A A and DA we obtain the two different partitions Ca and Cd'- 
La = {{(0.577,-0.99)}, 

{(0.577, 0.99), (0, 0.0001), (0, 0)}, 
{(-1.1551,0), (-1.155,0)}} 
Cd = {{(0.577,-0.99)}, 
{(0.577,0.99)}, 

{(0, 0.0001), (0, 0), (-1.1551, 0), (-1.155, 0)}} 

associated to the valid representatives whose specified values are 

= {(0.577, -0.99), (0.192333, 0.330033), (-1.15505,0)} 
Yd = {(0.577, 0.99), (0.577, -0.99), (-0.577525, 0.000025)} 

respectively. It is trivial to verify that the elements of £^ are pairwise not unifiable 
into a collapsable set, while the same property does not hold for the partition £|j 
since {(0.577, -0.99)^} U {(0.577,0.99)^} is a collapsable set. 
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6. Conclusions 

In this paper a new approach to reducing redundancy in sets of noisy data is 
described. The key idea is to work with empirical points, i.e. taking into con- 
sideration the componentwise tolerances on the input data. The two algorithms 
presented are included in CoCoALib which is available from the web site p/. 

The experimental results points out that it is faster to use DA when the set 
of empirical data is dense enough, since only a few splittings of the original set are 
needed. Conversely, when the points are well-separated, AA is preferable, as the 
final partition consists of a large number of sets and the algorithm will perform 
few iterations. The very quick grid method can be used to estimate the number of 
final partitions, and thus guide the choice between AA and DA. 
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